69 Machine Learning: Workflow
69.1 Steps in an ML Project
The workflow of a machine learning project includes several critical steps:
defining the problem - what is the issue, desired outcome, or question?
collecting and preparing data
choosing a model - which type of model is best for our problem?
training the model - running the model on some training data
evaluating the model - testing the model on some testing data
deploying the model to production - using the model on incoming data and evaluating its effectiveness
Each step is iterative and may require going back and forth as new insights are gained and model optimisations are made.
In this section, we’ll focus on step 2 above. The following section explores step 3.St
69.2 Types of Data in ML
Data is central to ML. The quality, quantity, and relevance of the data directly influence the performance of machine learning models.
High-quality data is the fuel for the ML engine; without it, even the most sophisticated algorithms cannot function. The phrase”garbage in, garbage out” is particularly apt in ML. High-quality data leads to models that can accurately capture the underlying patterns and predict future outcomes.
Data in ML can be broadly categorised into two types: structured and unstructured.
Structured data is highly organised and easily understood by machine learning models. It’s typically tabular data with rows and columns, where each column represents a specific attribute, and each row corresponds to a data record. Examples include spreadsheets, SQL databases, and CSV files.
Unstructured data, on the other hand, lacks a predefined data model and is therefore more challenging for algorithms to interpret. This type of digital data includes text, images, audio, and video. Despite these challenges, advances in ML, particularly deep learning, have significantly improved our ability to extract meaningful information from unstructured data.
69.3 Data Pre-Processing
Before data can be used for training machine learning models, it often requires pre-processing. This stage involves several key steps designed to convert raw data into a clean, organised format suitable for ML.
Basic data cleaning tasks include handling missing values, which might be addressed by removing data points, filling them (imputation) with a statistical measure (like the mean or median), or using a model to predict the missing values.
Outlier detection and removal is important, as outliers can skew the results of an ML model.
Additionally, data normalisation or standardisation is commonly performed to ensure that all numerical input features have a similar scale, preventing features with larger scales from dominating those with smaller scales.
Data preprocessing also involves feature selection and feature engineering.
Feature selection involves identifying the most relevant features to use as inputs for machine learning models, reducing dimensionality and improving model efficiency.
Feature engineering involves creating new features from the existing data, enhancing the model’s ability to learn from the data by introducing additional context or combining information in novel ways.
69.4 Preparing Data for Machine Learning
When preparing data, it’s crucial to clean the data by removing outliers and handling missing values.
Data transformation and normalisation are also key steps to ensure that the machine learning algorithms work effectively.
Recap on missing values
We’ve covered a few different techniques to deal with missing values, including:
Imputation: Filling in missing values using strategies like mean, median, or mode imputation, or more complex methods like K-Nearest Neighbors imputation.
Deletion: Removing records with missing values, especially if the missing data is extensive and imputation may introduce bias.
Also, we can use algorithms that support missing values. Some machine learning algorithms can handle missing values inherently, so selecting such algorithms could be a strategy.
Feature engineering
As mentioned above, when preparing for ML we sometimes want to create new variables from existing data. This is called ‘features engineering’. Some common approaches to this are:
Creating Interaction Terms: Combining two or more features to create interaction terms which may have a stronger relationship with the target variable.
Polynomial Features: Adding polynomial features can help in capturing non-linear relationships in the data.
Encoding Categorical Variables: Converting categorical variables into a format that can be provided to machine learning models, like one-hot encoding.
Data scaling and normalisation
See here for more detailed coverage of this topic, where a number of techniques were introduced including:
Standardisation (Z-score Normalisation): Transforming the data to have a mean of 0 and a standard deviation of 1.
Min-Max Scaling: Scaling and centering the data. For each feature, the minimum value of that feature gets transformed into a 0, the maximum value gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1.
Dealing with imbalanced data
Ideally, we have balanced data, which means that different categories (for example) within our dataset are equally represented. So, if dealing with multiple teams and multiple games, we have roughly the same number of games for each team.
Some techniques to address this include:
Resampling: Either oversampling the minority class, undersampling the majority class, or both.
Synthetic Data Generation: Creating synthetic samples of the minority class (e.g., using SMOTE).